Use MMI not CTC model for alignment #203

danpovey · 2021-06-02T04:45:39Z

No description provided.

danpovey · 2021-06-02T05:20:36Z

Below are some notes I made about results. There is a modest improvement of around 0.3% absolute on test-other, from using the MMI not CTC model for alignment.

  `mmiali` experiment, branch=mmiali.  Use MMI TDNN+LSTM model, not CTC model, for alignment; requires retraining
  MMI TDNN+LSTM model with subsampling-factor=4 to avoid mismatch.

 The baseline for what's below (which was trained with
 mmi_att_transformer_train.py with --world-size=2 and --full-libri=False) can be
 taken to be: 6.82%, 18.00%, 5.78%, 15.46%, taken from
  /ceph-dan/snowfall/egs/librispeech/asr/simple_v1/exp-conformer-noam-mmi-att-musan-sa-vgg-rework (the checked-in
  result with vgg frontend in RESULTS.md is with 1 job not 2).

 2021-06-02 10:49:26,220 INFO [common.py:380] [test-clean] %WER 6.81% [3583 / 52576, 496 ins, 284 del, 2803 sub ]
 2021-06-02 10:51:41,617 INFO [common.py:380] [test-other] %WER 17.64% [9234 / 52343, 1024 ins, 848 del, 7362 sub ]
[with 4-gram LM rescoring]:
 2021-06-02 12:11:52,226 INFO [common.py:391] [test-clean] %WER 5.72% [3009 / 52576, 566 ins, 158 del, 2285 sub ]
 2021-06-02 12:18:23,522 INFO [common.py:391] [test-other] %WER 15.18% [7946 / 52343, 1176 ins, 538 del, 6232 sub ]


  # Below is the model from exp-lstm-adam-mmi-bigram-musan-dist-s4/epoch-9.pt:
  this expt (with subsampling-factor=4):
     2021-06-01 21:08:15,043 INFO [mmi_bigram_decode.py:261] %WER 10.66% [5604 / 52576, 718 ins, 587 del, 4299 sub ]
  baseline (with subsampling-factor=3):
     2021-06-01 12:06:43,106 INFO [mmi_bigram_decode.py:261] %WER 10.38% [5455 / 52576, 713 ins, 510 del, 4232 sub ]

danpovey · 2021-06-02T06:03:27Z

egs/librispeech/asr/simple_v1/mmi_att_transformer_train.py

+            x = nnet_output.abs().sum().item()
+            if x - x != 0:
+                print("Warning: reverting nnet output since it seems to be nan.")
+                nnet_output = nnet_output_orig


@GNroy perhaps this is related to the error you had? I found that sometimes I'd get NaN's in the forward pass of the alignment model. I commented out ali_model.eval() as well as making this change, because I suspected that it had to do with test-mode batchnorm, but I might have been wrong, I need to test this. It might relate to float16 usage (or a combination of the two).

Thanks!
Actually, I resolved my issue.
NaNs were produced by the encoder part (not a loss or softmax problem as I thought before).
It was fixed with some hyperparameters re-tuning. In particular, setting eps=1e-3 for the optimizer helped.

danpovey · 2021-06-02T12:15:09Z

Results after training with 1 job only (and uncommenting ali_model.eval(), which I doubt it matters), were:

2021-06-02 19:53:43,152 INFO [common.py:391] [test-clean] %WER 6.85% [3604 / 52576, 530 ins, 278 del, 2796 sub ]
 2021-06-02 19:56:17,121 INFO [common.py:391] [test-other] %WER 17.57% [9195 / 52343, 1081 ins, 787 del, 7327 sub ]
and with LM rescoring:
 2021-06-02 19:55:17,350 INFO [common.py:391] [test-clean] %WER 5.83% [3065 / 52576, 612 ins, 158 del, 2295 sub ]
 2021-06-02 20:02:43,266 INFO [common.py:391] [test-other] %WER 15.30% [8006 / 52343, 1268 ins, 488 del, 6250 sub ]

vs. the checked-in results from @zhu-han which were:

# average over last 5 epochs (LM rescoring with whole lattice)
2021-05-02 00:36:42,886 INFO [common.py:381] [test-clean] %WER 5.55% [2916 / 52576, 548 ins, 172 del, 2196 sub ]
2021-05-02 00:47:15,544 INFO [common.py:381] [test-other] %WER 15.32% [8021 / 52343, 1270 ins, 501 del, 6250 sub ]

# average over last 5 epochs
2021-05-01 23:35:17,891 INFO [common.py:381] [test-clean] %WER 6.65% [3494 / 52576, 457 ins, 293 del, 2744 sub ]
2021-05-01 23:37:23,141 INFO [common.py:381] [test-other] %WER 17.68% [9252 / 52343, 1020 ins, 858 del, 7374 sub ]

... so according to this, it does not really make a difference which model we use for alignment.

pzelasko · 2021-06-02T12:23:26Z

Would it make sense to use a pure TDNN/TDNNF/CNN model for alignments? I was investing alignments from the conformer recently and my feeling was that they weren't perfect (even though the test-clean WER is ~4%) -- i.e., they seem a bit warped/shifted sometimes, but not in a consistent way. I think that the self-attention layers allow to "cheat" to some extent with the alignments, I don't know if the same happens with RNN. I doubt that the same would happen with local-context models though. Unfortunately, I don't have any means to provide a more objective evaluation than showing a screenshot (look closely at the boundaries with silences).

danpovey · 2021-06-02T13:26:57Z

That's interesting, how did you obtain that plot?
I think it may be hard to prevent the conformer model from doing this kind of thing using the current
alignment method, since it's only present early in training and is not really a constraint.

I am thinking it might be possible, though, if we had model that was good for alignment, to save 'constraints'
derived from it, similar to what we do with Kaldi's LF-MMI training. That is: to get (say) the one-best path from it,
save it as a tensor of int32_t e.g. as a .pt file indexed by utterance-id, and load that when training; and then to
extend the boundaries of the phones by a couple frames and treat it as a mask on the nnet output, masking
all (non-blank) phones that are not allowed by the alignment by adding a negative number to them.
The only thing is, this will tend to interact with data augmentation and batching. It might be a little complicated
to have that information pass through those transforms.

pzelasko · 2021-06-02T14:09:14Z

I'll submit a PR with the code that allows computing alignments and visualizing them later.

As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader.

danpovey · 2021-06-02T14:52:11Z

cool!

…

On Wednesday, June 2, 2021, Piotr Żelasko ***@***.***> wrote: I'll submit a PR with the code that allows computing alignments and visualizing them later. As to data augmentation of alignments, we could extend most transforms to handle it -- I'm pretty sure we can still do speed perturbation, noise mixing, specaug masks (but probably not the warping). We don't have reverb in Lhotse yet, but probably it's straightforward as well. Batching is possible too, but I think the alignments would need to be a part of Lhotse rather than external to it, so we can process them properly with everything else in the dataloader. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO355IV3GUMJCN54IFTTQY3RVANCNFSM45565RSQ> .

pzelasko · 2021-07-12T17:21:04Z

Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important.

danpovey · 2021-07-13T03:16:11Z

Mm, I would have expected the MMI one would be better; but since we're just using this at the start of training to guide the model towards plausible alignments, it could be that the difference gets lost by the end.

…

On Tue, Jul 13, 2021 at 1:21 AM Piotr Żelasko ***@***.***> wrote: Regarding this: it's actually weird that CTC and MMI alimdl would not make a difference. Some time ago, I think I looked at both CTC and MMI posteriors, and they are much different -- the CTC posteriors are spiky, and MMI posteriors are not (i.e. MMI tends to recognize repeated phone ids whereas CTC tends to recognize one phone id followed by blank). The way alimdl's posteriors are added to the main model's posteriors, I'd think it would be important. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO3JJUCLR6JWKDUK6H3TXMQAXANCNFSM45565RSQ> .

danpovey added 2 commits June 2, 2021 12:33

Use MMI, not CTC, alignment model.

b11c76a

fix get_texts() to work with recent changes in k2.

caa980e

danpovey commented Jun 2, 2021

View reviewed changes

pzelasko mentioned this pull request Jun 2, 2021

Torchscriptable Conformer + high-level "simple" object for decoding, alignments, posteriors, plotting them, etc. #206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MMI not CTC model for alignment #203

Use MMI not CTC model for alignment #203

danpovey commented Jun 2, 2021

danpovey commented Jun 2, 2021

danpovey Jun 2, 2021

GNroy Jun 7, 2021

danpovey commented Jun 2, 2021

pzelasko commented Jun 2, 2021

danpovey commented Jun 2, 2021

pzelasko commented Jun 2, 2021

danpovey commented Jun 2, 2021 via email

pzelasko commented Jul 12, 2021

danpovey commented Jul 13, 2021 via email

Use MMI not CTC model for alignment #203

Are you sure you want to change the base?

Use MMI not CTC model for alignment #203

Conversation

danpovey commented Jun 2, 2021

danpovey commented Jun 2, 2021

danpovey Jun 2, 2021

Choose a reason for hiding this comment

GNroy Jun 7, 2021

Choose a reason for hiding this comment

danpovey commented Jun 2, 2021

pzelasko commented Jun 2, 2021

danpovey commented Jun 2, 2021

pzelasko commented Jun 2, 2021

danpovey commented Jun 2, 2021 via email

pzelasko commented Jul 12, 2021

danpovey commented Jul 13, 2021 via email